Regression model - exploration

This notebook explores and models the data collected from recordings of the natural acoustic environment over the urban-rural gradient near Innsbruck, Austria. The models are implemented as Bayesian models with the PyMC3 probabilistic programming library.

References:
https://github.com/fonnesbeck/multilevel_modeling
Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (1st ed.). Cambridge University Press.

Import statements



In [1]:

    
import warnings
warnings.filterwarnings('ignore')



In [2]:

    
import pandas
import numpy



In [3]:

    
%matplotlib inline
from matplotlib import pyplot
from matplotlib.patches import Rectangle
import seaborn
import mpld3
from mpld3 import plugins



In [4]:

    
from pymc3 import glm, Model, NUTS, sample, \
                  forestplot, traceplot, plot_posterior, summary, \
                  Normal, Uniform, Deterministic, StudentT, stats

set chart style to seaborn defaults



In [5]:

    
seaborn.set()

Variable definitions



In [6]:

    
data_filepath = "/Users/Jake/OneDrive/Documents/alpine soundscapes/data/dataset.csv"



In [7]:

    
seaborn_blue = seaborn.color_palette()[0]

Load data



In [8]:

    
data = pandas.read_csv(data_filepath)
data = data.loc[data.site<=30]



In [9]:

    
data[['site', 'site_name', 'building_200m', 'forest_200m', 'd2n_200m']].head()









    Out[9]:







  
    
      
      site
      site_name
      building_200m
      forest_200m
      d2n_200m
    
  
  
    
      1
      15
      Hofgarten
      12.094186
      40.635720
      0.142291
    
    
      2
      26
      Bienenstraße
      26.502499
      20.139944
      0.209561
    
    
      3
      5
      Gumppstraße
      41.708074
      12.070743
      0.288953
    
    
      4
      21
      Templstraße
      42.530716
      17.342644
      0.266094
    
    
      5
      18
      Egger-Lienz-Straße
      25.838039
      12.899635
      0.330288

sort data by site and then by visit



In [10]:

    
data = data.sort_values(by=['site', 'sound']).reset_index(drop=True)



In [11]:

    
data[['site', 'site_name', 'biophony', 
      'forest_200m', 'd2n_200m', 'temperature', 'week']].head(n=10)









    Out[11]:







  
    
      
      site
      site_name
      biophony
      forest_200m
      d2n_200m
      temperature
      week
    
  
  
    
      0
      1
      Lans
      1.555473
      70.934164
      0.049909
      9.9
      4.000000
    
    
      1
      1
      Lans
      1.022978
      70.934164
      0.049909
      10.4
      6.000000
    
    
      2
      1
      Lans
      3.600605
      70.934164
      0.049909
      18.0
      10.857143
    
    
      3
      1
      Lans
      6.499787
      70.934164
      0.049909
      23.9
      12.142857
    
    
      4
      1
      Lans
      4.385699
      70.934164
      0.049909
      7.9
      13.000000
    
    
      5
      2
      Hofwald
      0.329502
      97.068706
      0.003761
      5.9
      3.142857
    
    
      6
      2
      Hofwald
      0.321554
      97.068706
      0.003761
      4.3
      5.142857
    
    
      7
      2
      Hofwald
      1.815957
      97.068706
      0.003761
      2.1
      6.857143
    
    
      8
      2
      Hofwald
      2.847029
      97.068706
      0.003761
      22.2
      9.857143
    
    
      9
      2
      Hofwald
      1.652687
      97.068706
      0.003761
      13.9
      12.857143

Plot data

Here is a pairwise plot of a subset of the data. The 'week' variable represents the number of week each recording was taken from the start of field collection (Janurary 27, 2016). The 'd2n_200m' represents the average Distance to Nature (d2n) value, a landscape index representing the amount of human alteration, within a 200-meter radius of each recording location. The 'temperature' variable represents the temperature in degrees Celcius, averaged from the measurements taken at the University of Innsbruck weather station 30 minutes before and after each recording.



In [12]:

    
pairplot = seaborn.pairplot(data[['biophony', 'sel', 
                                  'week', 'temperature', 'd2n_500m']])

There is definitely a relationship between biophony and landcover, although a more robust analysis might be able to explain the relationship better than the simple regression shown below.



In [13]:

    
biophony_jointplot = seaborn.jointplot('d2n_100m', 'biophony', data=data, kind='reg', 
                                       xlim=(-0.1, 0.6), ylim=(0, 15))

Landcover seems to have a greater effect on anthrophony.



In [14]:

    
anthrophony_jointplot = seaborn.jointplot('d2n_100m', 'sel', data=data, kind='reg', 
                                          xlim=(-0.1, 0.6), ylim=(-70, -10))

As the boxplots show, repeated measurements of percent biophony and sel varry little at some sites and greatly at others.



In [15]:

    
biophony_boxplot = seaborn.boxplot('site', 'biophony', data=data)
biophony_boxplot.figure.set_figwidth(10)



In [16]:

    
anthrophony_boxplot = seaborn.boxplot('site', 'sel', data=data)
anthrophony_boxplot.figure.set_figwidth(10)

Complete pooling

We can look at a standard regression analysis to compare it with more robust multilevel models. We start with a "complete pooling" analysis, considering only the 'week' variable (effectively time of year) as a predictor for biophony.

$$ \begin{align} y_i \sim \mathcal{N}(\mu, \sigma^2) \\ \mu = \alpha + \beta x_i \\ \end{align} $$

We start with the model defined above, where yi is the biophony at recording i, alpha is the intercept, beta is the slope, and xi is the number of week from the start of field collection for recroding i.

We can define the model in PyMC3 and run it with 1000 samples.



In [17]:

    
with Model() as pooled_model:
    glm.GLM.from_formula("biophony ~ week", data)
    pooled_trace = sample(1000, NUTS(), random_seed=1)









    



100%|██████████| 1500/1500 [00:03<00:00, 439.60it/s]



In [18]:

    
p = plot_posterior(pooled_trace, varnames=['Intercept', 'week'])
# TODO: fix font style

Here are the mean parameter values (avoiding the burn-in period, or the first 500 samples).



In [19]:

    
intercept0 = pooled_trace['Intercept'][500:].mean()
slope0 = pooled_trace['week'][500:].mean()



In [20]:

    
intercept0









    Out[20]:





0.5526711747868819



In [21]:

    
slope0









    Out[21]:





0.2589708822105554

Here is a plot of the data and the fitted regression line.



In [22]:

    
fig, ax = pyplot.subplots(figsize=(15, 5))
ax.scatter(data.week, data.biophony, marker='.', color='black', alpha=0.5)
x_week = numpy.linspace(0, 100)
# plot 50 samples from the posterior of the parameters 
for d in numpy.random.random_integers(500, 999, 50):
    p1 = ax.plot(x_week, pooled_trace['Intercept'][d] + pooled_trace['week'][d] * x_week, 
                 color=seaborn_blue, alpha=0.1)
# plot mean parameter values
p1 = ax.plot(x_week, intercept0 + slope0*x_week, 'k--')
l1 = ax.set_xlabel("weeks since project start")
l2 = ax.set_ylabel("biophony")
f1 = ax.set_xlim([0, 14])
f1 = ax.set_ylim([0, 15])

No pooling

The complete pooling model above does not allow the regression parameters to vary by site, which is likely unrealistic. For example, it seems probable that the change of biophony over time could vary between location. We can see what happens if we fit a separate regression to each site (different intercept (alpha) and slope (beta) for each site j).

$$y_{ij} = \alpha_j + \beta_j x_i$$



In [23]:

    
sample_sites = (2, 9, 11, 15, 17, 18, 26, 30)



In [24]:

    
fig, ax = pyplot.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
ax = ax.ravel()
for i, site in enumerate(sample_sites):
    # plot data and regression line
    y = data.biophony[data.site==site]
    x = data.week[data.site==site]
    seaborn.regplot(x, y, ax=ax[i], ci=None, color='grey')
    
    # plot complete pooling regression line and set format
    x_week = numpy.linspace(0, 100)
    ax[i].plot(x_week, (slope0 * x_week) + intercept0, 'r--')
    ax[i].set_title("{0} - {1}".format(site, data.site_name[data.site == site].unique()[0]))
    ax[i].set_ylim([0, 15])

Partial pooling (empty model)

We can see from the no pooling example that biophony does seem to change differently at each location. However, as shown below, a multilevel model allows us to apply a single model to the data set while allowing regression parameters to vary between sites. For sites with a lesser amount of measurements, the regression parameters are pulled towards the parameters of the entire data set (towards complete pooling), but for sites with a large amount of measurements, the regression parameters are not pulled towards the entire data set (towards no pooling).

The model shown below contains no predictors (refered to as the empty model), and thus shows only how the mean biophony varies between locations.

$$y_{ij} = \beta_{j} + r_{ij}$$$$\beta_{j} = \gamma + u_{j}$$

In the equation above, y represents the biophony for recording i and location j, beta represents the mean biophony of each location j, gamma represents the grand mean of biophony (mean of the location means), u represents the difference between the mean biophony at location j and the grand mean biophony, and r represents the difference between the biophony of recording i at location j and the mean biophony at location j. The model can also be thought of as defining a separate intercept for each site with no slope.

We can define this model in PyMC3 as shown below.
NOTE: While the model equations are shown in frequentist form, PyMC3 implements the models in a Bayesian context.



In [25]:

    
sites = numpy.copy(data.site.values) - 1



In [26]:

    
with Model() as partial_pooling:
    
    # Priors
    mu_a = Normal('mu_a', mu=0., tau=0.0001)
    sigma_a = Uniform('sigma_a', lower=0, upper=100)
    tau_a = sigma_a**-2
    
    # Random intercepts
    a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
    
    # Model error
    sigma_y = Uniform('sigma_y', lower=0, upper=100)
    tau_y = sigma_y**-2
    
    # Expected value
    y_hat = a[sites]
    
    # Data likelihood
    y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data.biophony)



In [27]:

    
with partial_pooling:
    step = NUTS()
    partial_pooling_samples = sample(2000, step, random_seed=1)









    



100%|██████████| 2500/2500 [00:05<00:00, 482.75it/s]

We can plot both the unpooled means and the partially-pooled means.



In [28]:

    
sample_trace = partial_pooling_samples['a'][-1000:]

fig, ax = pyplot.subplots(1, 2, figsize=(14,6), sharex=True, sharey=True)
samples, sites = sample_trace.shape
jitter = numpy.random.normal(scale=0.1, size=sites)

n_site = data.groupby('site')['biophony'].count()
unpooled_means = data.groupby('site')['biophony'].mean()
unpooled_sd = data.groupby('site')['biophony'].std()
unpooled = pandas.DataFrame({'n':n_site, 'm':unpooled_means, 'sd':unpooled_sd})
unpooled['se'] = unpooled.sd / numpy.sqrt(unpooled.n)

# plot unpooled means
ax[0].scatter(unpooled.n + jitter, unpooled.m, marker='o', color='black')
for j, row in zip(jitter, unpooled.iterrows()):
    name, dat = row
    ax[0].plot([dat.n+j,dat.n+j], [dat.m-dat.se, dat.m+dat.se], 
               linestyle='-', color='black', alpha=0.25)
t1 = ax[0].hlines(sample_trace.mean(), 0.9, 100, linestyles='--')
ax[0].set_xticks([3, 4, 5])
t2 = ax[0].set_title('unpooled means')
t3 = ax[0].set_xlabel('number of measurements (jittered)')
t4 = ax[0].set_ylabel('biophony')

# plot partially-pooled means
samples, sites = sample_trace.shape
means = sample_trace.mean(axis=0)
sd = sample_trace.std(axis=0)
ax[1].scatter(n_site.values + jitter, means, marker='o', color='purple')
ax[1].set_xlim(2, 6)
ax[1].hlines(sample_trace.mean(), 0.9, 100, linestyles='--')
for j,n,m,s in zip(jitter, n_site.values, means, sd):
    ax[1].plot([n+j]*2, [m-s, m+s], linestyle='-', color='purple', alpha=0.25)
t1 = ax[1].set_title('partially-pooled means')
t2 = ax[1].set_xlabel('number of measurements (jittered)')
t3 = ax[1].set_ylabel('biophony')

Note that there is less pooling for sites with a higher number of measurements.

Varying intercepts

Going forward with the multilevel model and partial pooling, we can begin to add predictors. This model adds the 'week' variable as a predictor, but does not allow the effect (slope) to vary between locations. However, in this model the intercepts are allowed to vary, but without a predictor.

$$ \begin{align} \text{level 1} \\ y_{ij} = \beta_{0j} + \beta_{1j}X_{ij} + r_{ij} \\ \text{level 2} \\ \beta_{0j} = \gamma_{00} + u_{0j} \\ \beta_{1j} = \gamma_{10} \\ \end{align} $$



In [29]:

    
sites = numpy.copy(data.site.values) - 1



In [30]:

    
with Model() as varying_intercept:
    
    # Priors
    mu_a = Normal('mu_a', mu=0., tau=0.0001)
    sigma_a = Uniform('sigma_a', lower=0, upper=100)
    tau_a = sigma_a**-2
    
    # Random intercepts
    a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
    # Common slope
    b = Normal('b', mu=0., tau=0.0001)
    
    # Model error
    sigma_y = Uniform('sigma_y', lower=0, upper=100)
    tau_y = sigma_y**-2
    
    # Expected value
    y_hat = a[sites] + b * data.week
    
    # Data likelihood
    y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data.biophony)



In [31]:

    
with varying_intercept:
    step = NUTS()
    varying_intercept_samples = sample(2000, step, random_seed=1)









    



100%|██████████| 2500/2500 [00:06<00:00, 358.71it/s]



In [32]:

    
site_labels = ["{1} - {0}".format(s, 
                                  data.site_name[data.site == s].unique()[0])\
               for s in range(1, 31)]
pyplot.figure(figsize=(5, 8))
p1 = forestplot(varying_intercept_samples, varnames=['a'], 
                ylabels=site_labels,
                xtitle="biophony (intercept)")

We can also plot the sample traces and distributions for other variables defined in the model. Here 'sigma_a' represents the variance between the grand biophony mean and each location biophony mean, and 'b' represents the slope for all locations.



In [33]:

    
p1 = traceplot(varying_intercept_samples[-1000:], varnames=['sigma_a', 'b'])

We can also print a summary with confidence for each model variable. The summary below shows that the estimated slope ('b') parameter is 0.259, meaning that for each increase in 'week' the biophony value increases by 0.259



In [34]:

    
summary(varying_intercept_samples[-1000:], varnames=['b'])









    



b:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.259            0.049            0.003            [0.167, 0.351]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.168          0.224          0.258          0.293          0.352

Now we can plot a regression line for each location.



In [35]:

    
fig, ax = pyplot.subplots(figsize=(15, 5))
x_week = numpy.arange(0, 110, 10)
bp = varying_intercept_samples['a'].mean(axis=0)
mp = varying_intercept_samples['b'].mean()
ax.scatter(data.week, data.biophony, marker='.', color='black', alpha=0.5)
for bi in bp:
    ax.plot(x_week, mp * x_week + bi, color='purple', marker=None, alpha=0.1)
l1 = ax.set_xlabel("week since project start")
l2 = ax.set_ylabel("biophony")
f1 = ax.set_xlim([0, 14])
f2 = ax.set_ylim([-5, 15])

Here are separate plots for each location, showing three different regressions.



In [36]:

    
fig, ax = pyplot.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
ax = ax.ravel()
for i, site in enumerate(sample_sites):
    
    # Plot county data and unpooled regression
    y = data.biophony[data.site==site]
    x = data.week[data.site==site]
    seaborn.regplot(x, y, ax=ax[i], ci=None, color='grey')
    
    # plot pooled and partially-pooled regressions
    x_week = numpy.linspace(0, 14)
    # Pooled estimate
    ax[i].plot(x_week, slope0 * x_week + intercept0, 'r--')
    # Partial pooling esimate
    ax[i].plot(x_week, mp * x_week + bp[site - 1], color='purple', marker=None, linestyle='-')
    ax[i].set_title("{0} - {1}".format(site, data.site_name[data.site == site].unique()[0]))
    ax[i].set_ylim([-5, 15])

Varying slope

We can also see what happens if we fix a common intercept, but allow the slope to vary between locations.

$$ \begin{align} \text{level 1} \\ y_{ij} = \beta_{0j} + \beta_{1j}X_{ij} + r_{ij} \\ \text{level 2} \\ \beta_{0j} = \gamma_{00} \\ \beta_{1j} = \gamma_{10} + u_{1j} \\ \end{align} $$



In [37]:

    
sites = numpy.copy(data.site.values) - 1



In [38]:

    
with Model() as varying_slope:
    
    # Priors
#    mu_b = Normal('mu_b', mu=0., tau=0.0001)
    sigma_b = Uniform('sigma_b', lower=0, upper=100)
    tau_b = sigma_b**-2
    
    # Model intercept
#    a = Normal('a', mu=0., tau=0.0001)
    a = Uniform('a', lower=-10, upper=10)
    # Random slopes
#    b = Normal('b', mu=mu_b, tau=tau_b, shape=len(set(sites)))
    b = Uniform('b', lower=-10, upper=10, shape=len(set(sites)))
    
    # Model error
    sigma_y = Uniform('sigma_y', lower=0, upper=100)
    tau_y = sigma_y**-2
    
    # Expected value
    y_hat = a + b[sites] * data.week
    
    # Data likelihood
    y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data.biophony)



In [39]:

    
with varying_slope:
    step = NUTS()
    varying_slope_samples = sample(2000, step, random_seed=1)









    



100%|██████████| 2500/2500 [00:06<00:00, 396.26it/s]

Here is a plot of the estimated slope for each location.



In [40]:

    
site_labels = ["{1} - {0}".format(s, 
                                  data.site_name[data.site == s].unique()[0])\
               for s in range(1, 31)]
pyplot.figure(figsize=(5, 8))
p1 = forestplot(varying_slope_samples, varnames=['b'], 
                ylabels=site_labels,
                xtitle="slope")



In [42]:

    
vs_trace = varying_slope_samples[:1500]
traceplot(vs_trace, varnames=['b'])









    Out[42]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1c252cd828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1c25306cf8>]],
      dtype=object)

This result is interesting, as locations with an estimated slope near 0 (e.g., Wilten and Templstraße) are generally located in areas with mostly urban land cover, and other locations with higher estimated slopes (e.g., Pumhof and Wittenberg) are generally located in areas with a large amount of forest land cover.

Here is a plot of the regression lines for all the locations.



In [43]:

    
fig, ax = pyplot.subplots(figsize=(15, 5))

# all data
p1 = ax.scatter(data.week, data.biophony, marker='.', color='black', alpha=0.5)

# varying slopes
x_week = numpy.arange(0, 16, 1)
b = varying_slope_samples['a'].mean()
m = varying_slope_samples['b'].mean(axis=0)
for mi in m:
    pi = ax.plot(x_week, mi * x_week + b, marker=None, color=seaborn_blue, alpha=0.1)
    
# pooled
p2 = ax.plot(x_week, intercept0 + slope0 * x_week, marker=None, color='black', linestyle='--')

# formatting
l1 = ax.set_xlabel("week since project start")
l2 = ax.set_ylabel("biophony")
f1 = ax.set_xlim([0, 15])

Varying slope and intercept

Now we can try varying both slope and intercept.

$$ \begin{align} \text{level 1} \\ y_{ij} = \beta_{0j} + \beta_{1j}X_{ij} + r_{ij} \\ \text{level 2} \\ \beta_{0j} = \gamma_{00} + u_{0j} \\ \beta_{1j} = \gamma_{10} + u_{1j} \\ \end{align} $$



In [44]:

    
sites = numpy.copy(data.site.values) - 1



In [45]:

    
with Model() as varying_intercept_slope:
    
    # Priors    
    mu_a = Normal('mu_a', mu=0., tau=0.0001)
    sigma_a = Uniform('sigma_a', lower=0, upper=100)
    tau_a = sigma_a**-2
    
    mu_b = Normal('mu_b', mu=0., tau=0.0001)
    sigma_b = Uniform('sigma_b', lower=0, upper=100)
    tau_b = sigma_b**-2
    
    # Random intercepts
    a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
    # Random slopes
    b = Normal('b', mu=mu_b, tau=tau_b, shape=len(set(sites)))
    
    # Model error
    sigma_y = Uniform('sigma_y', lower=0, upper=100)
    tau_y = sigma_y**-2
    
    # Expected value
    y_hat = a[sites] + b[sites] * data.week
    
    # Data likelihood
    y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data.biophony)
    y_sim = Normal('y_sim', mu=y_hat, tau=tau_y, shape=y_hat.tag.test_value.shape)



In [46]:

    
with varying_intercept_slope:
    step = NUTS()
    varying_intercept_slope_samples = sample(2000, step, random_seed=1)









    



100%|██████████| 2500/2500 [00:16<00:00, 156.08it/s]



In [47]:

    
fig, ax = pyplot.subplots(figsize=(15, 5))

# all data
p1 = ax.scatter(data.week, data.biophony, marker='.', color='black', alpha=0.5)

# varying slopes and intercepts
x_week = numpy.arange(0, 16, 1)
b = varying_intercept_slope_samples['a'].mean(axis=0)
m = varying_intercept_slope_samples['b'].mean(axis=0)
for bi,mi in zip(b,m):
    pi = ax.plot(x_week, mi * x_week + bi, marker=None, color=seaborn_blue, alpha=0.1)

# pooled
p2 = ax.plot(x_week, intercept0 + slope0 * x_week, marker=None, color='black', linestyle='--')

# formatting
l1 = ax.set_xlabel("week since project start")
l2 = ax.set_ylabel("biophony")
f1 = ax.set_xlim([0, 15])

Here are plots for individual locations again, comparing this model (orange) compares with the unpooled model (gray).



In [48]:

    
fig, ax = pyplot.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
ax = ax.ravel()
for i, site in enumerate(sample_sites):
    
    # Plot pooled and partially-pooled regressions
    x_week = numpy.linspace(0, 16)
    # Pooled estimate
    ax[i].plot(x_week, slope0 * x_week + intercept0, 'r-', alpha=0.5)
    # Partial pooling esimate
    ax[i].plot(x_week, mp * x_week + bp[site - 1], color='purple', marker=None, linestyle='-', alpha=0.5)
    # Varying slopes and intercepts
    for d in numpy.random.random_integers(500, 1999, 100):
        ax[i].plot(x_week, 
                   (varying_intercept_slope_samples['b'][d][site - 1] * x_week) + varying_intercept_slope_samples['a'][d][site - 1], 
                   color=seaborn_blue, marker=None, alpha=0.1)
    
    # Plot site data and unpooled regression
    y = data.biophony[data.site==site]
    x = data.week[data.site==site]
    seaborn.regplot(x, y, ax=ax[i], ci=None, color='black', line_kws={'linestyle': '--', 'linewidth': 0.5})
    
    ax[i].plot(x_week, m[site - 1] * x_week + b[site - 1], color='black', marker=None, linestyle='--')
    ax[i].set_title("{0} - {1}".format(site, data.site_name[data.site == site].unique()[0]))
    ax[i].set_xlim([0, 15])
    ax[i].set_ylim([-2, 16])



In [49]:

    
trace = varying_intercept_slope_samples[-1500:]
a_means = trace['a'].mean(axis=0)
b_means = trace['b'].mean(axis=0)

Model 3 — adding group-level predictors

Now we can try to see if we can assign a predictor to the level 2 (group or location level) parameters. Land cover is likely to be a significant predictor of the amount of biophony (intercept) and how it changes over time (slope).

There are many predictor variables in the data set, but we can define an aggregate predictor from what is likely to be the predictors with the largest effect. We define the 'forest_200m' variable that is the percentage of forest land cover minus the combined percentage of buildings and paved surfaces land cover within a 200-meter radius of each recording location. We can later check the fit of the model with many different predictor variables.



In [50]:

    
data[['site', 'd2n_200m', 'forest_200m']].groupby('site').mean().head()









    Out[50]:







  
    
      
      d2n_200m
      forest_200m
    
    
      site
      
      
    
  
  
    
      1
      0.049909
      70.934164
    
    
      2
      0.003761
      97.068706
    
    
      3
      0.195928
      13.668314
    
    
      4
      0.034394
      79.950193
    
    
      5
      0.288953
      12.070743

Here is the model, this time defined in Bayesian form.

$$ \begin{align} \text{level 1} \\ y_i \sim \mathcal{N}(\alpha_{j[i]} + \beta_{j[i]} x_{i}, \sigma^2_y), \\ \text{for $i = 1,...,n$} \\ \text{level 2} \\ \alpha_j \sim \mathcal{N}(\gamma_{00} + \gamma_{01} u_{j}, \sigma^2_{\alpha}), \\ \beta_{j} \sim \mathcal{N}(\gamma_{10} + \gamma_{11} u_{j}, \sigma^2_{\beta}), \\ \text{for $j = 1,...,J$} \end{align} $$

where $x_i$ is the measurement-level week indicator and $u_j$ is the location-level land cover indicator

We transform and center some of the variables to make it easier to interpret the model results.



In [51]:

    
data['biophony_centered'] = data.biophony - data.biophony.mean()
data['weeks_centered'] = data.week - data.week.mean()
data['forest_200m_centered'] = data.forest_200m - data.forest_200m.mean()
data['d2n_200m_centered'] = data.d2n_200m - data.d2n_200m.mean()



In [52]:

    
sites = numpy.copy(data.site.values) - 1



In [53]:

    
with Model() as model_3:
    
    # intercept
    g_00 = Normal('g_00', mu=0, tau=0.001)
    g_01 = Normal('g_01', mu=0, tau=0.001)
    sigma_a = Uniform('sigma_a', lower=0, upper=100)
    tau_a = sigma_a**-2
    mu_a = g_00 + (g_01 * data.forest_200m_centered.unique())
    a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
    
    # slope
    g_10 = Normal('g_10', mu=0, tau=0.001)
    g_11 = Normal('g_11', mu=0, tau=0.001)
    sigma_b = Uniform('sigma_b', lower=0, upper=100)
    tau_b = sigma_b**-2
    mu_b = g_10 + (g_11 * data.forest_200m_centered.unique())
    b = Normal('b', mu=mu_b, tau=tau_b, shape=len(set(sites)))
    
    # model error (data-level)
    sigma_y = Uniform('sigma_y', lower=0, upper=100)
    tau_y = sigma_y**-2
    
    # expected values
    y_hat = a[sites] + (b[sites] * data.weeks_centered)
    
    # likelihood
    y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data.biophony_centered)
    
    # simulated
    #y_sim = Normal('y_sim', mu=y_hat, tau=tau_y, shape=y_hat.tag.test_value.shape)



In [54]:

    
with model_3:
    step = NUTS()
    model_3_samples = sample(10000, step, random_seed=1)









    



100%|██████████| 10500/10500 [00:44<00:00, 237.57it/s]

We can compute the estimated intercept and slope for each location based on the mean of the last 5000 samples from the model trace.



In [55]:

    
trace = model_3_samples[-5000:]
a_means = trace['a'].mean(axis=0)
b_means = trace['b'].mean(axis=0)

Now we can compare how well the estimated level 2 parameters predict the estimated level 1 parameters.



In [56]:

    
fig, ax = pyplot.subplots(1, 2, sharex=True, figsize=(15, 5))

# plot estimated level 2 parameter (intercept)
intercept_points = ax[0].scatter(data.forest_200m_centered.unique(), a_means, color='black', alpha=0.4)
g00 = trace['g_00'].mean()
g01 = trace['g_01'].mean()
x_forest = numpy.linspace(-70, 70)
p1 = ax[0].plot(x_forest, g00 + (g01 * x_forest), color='black', linestyle='--')
pl = ax[0].set_xlim(-60, 60)

# plot estimated level 1 parameters
a_se = trace['a'].std(axis=0)
for xi, m, se in zip(data.forest_200m_centered.unique(), a_means, a_se):
    pi = ax[0].plot([xi, xi], [m-se, m+se], color='black', alpha=0.1)
t1 = ax[0].set_xlabel('net forest within a 200 meter radius')
t2 = ax[0].set_ylabel('intercept estimate (difference from mean biophony)')
t0 = ax[0].set_title('predicting intercept estimates')

# for interactive plot via mpld3
#intercept_point_labels = ["{0}: {1:0.2f}".format(n, s) for n, s in zip(data.site_name.unique(), a_means)]
#tooltip = plugins.PointLabelTooltip(intercept_points, intercept_point_labels)
#plugins.connect(fig, tooltip)
#mpld3.display(fig)

# plot estimated level 2 parameter (slope)
slope_points = ax[1].scatter(data.forest_200m_centered.unique(), b_means, color='black', alpha=0.4)
g10 = trace['g_10'].mean()
g11 = trace['g_11'].mean()
x_forest = numpy.linspace(-70, 70)
p2 = ax[1].plot(x_forest, g10 + (g11 * x_forest), color='black', linestyle='--')
pl = ax[1].set_xlim(-60, 60)

# plot estimated level 1 parameters
b_se = trace['b'].std(axis=0)
for xi, m, se in zip(data.forest_200m_centered.unique(), b_means, b_se):
    pi = ax[1].plot([xi, xi], [m-se, m+se], color='black', alpha=0.1)
t3 = ax[1].set_xlabel('net forest within a 200 meter radius')
t4 = ax[1].set_ylabel('slope estimate')
t5 = ax[1].set_title('predicting slope estimates')

# for interactive plot via mpld3
#slope_point_labels = ["{0}: {1:0.2f}".format(n, s) for n, s in zip(data.site_name.unique(), b_means)]
#tooltip = plugins.PointLabelTooltip(slope_points, slope_point_labels)
#plugins.connect(fig, tooltip)
#mpld3.display(fig)

Above, we can see that the land cover variable (forest_200m) can decentely predict how much biophony changes at each location over time (right plot). The intercept values do not fit quite as well, although there still seems to be a clear relationship (left plot). The two major outliers may be caused by errors in the source separation algorithm. We can compute an R squared value for each level 2 parameter and for all of level 1 (see below).

First, here are individual plots for all the locations.



In [57]:

    
fig, ax = pyplot.subplots(6, 5, figsize=(15, 20), sharey=True, sharex=True)
ax = ax.ravel()
for i, site in enumerate(data.site.unique()):
    
    x_week = numpy.linspace(-10, 10, 2)
    
    # draw parameter samples
    for d in numpy.random.random_integers(0, 4999, 100):
        ax[i].plot(x_week, 
                   (trace['b'][d][site - 1] * x_week) + trace['a'][d][site - 1], 
                   color=seaborn_blue, marker=None, alpha=0.1)
    
    # observed data (points)
    y = data.biophony_centered[data.site==site]
    x = data.weeks_centered[data.site==site]
    ax[i].scatter(x, y, color='black')
    
    # model estimate (line)
    x_week = numpy.linspace(-10, -10, 2)
    ax[i].plot(x_week, a_means[i] + (b_means[i] * x_week), color='black', linestyle='--')
    ax[i].set_title("{0} - {1}".format(site, data.site_name[data.site == site].unique()[0]))
    ax[i].text(-6, 12, "slope: {0:.02f}".format(b_means[i]), va='bottom')
    ax[i].set_xlim([-7, 7])
    ax[i].set_ylim([-10, 15])

Calculate R squared

As defined in Gelman and Hill 2006, we can compute (effective) R squared values for the multilevel model.



In [58]:

    
# level 1
a_means_expanded = numpy.array([a_means[v-1] for v in data.site])
b_means_expanded = numpy.array([b_means[v-1] for v in data.site])
y_hat = a_means_expanded + (b_means_expanded * data.weeks_centered)
e_y = data.biophony_centered - y_hat
r2_level1 = 1 - (e_y.var() / data['biophony_centered'].var())

# level 2
a_hat = g00 + (g01 * data.groupby('site')['forest_200m'].mean())
b_hat = g10 + (g11 * data.groupby('site')['forest_200m'].mean())
e_a = a_means - a_hat
e_b = b_means - b_hat
# R squared for intercept estimates
r2_level2a = 1 - (e_a.var() / a_means.var())
# R squared for slope estimates
r2_level2b = 1 - (e_b.var() / b_means.var())



In [59]:

    
print("R squared values:\n\
      level 1\n\
      ----------\n\
      overall:    {0:0.2f}\n\
      \n\
      level 2\n\
      ----------\n\
      intercepts: {1:0.2f}\n\
      slopes:     {2:0.2f}".format(r2_level1, r2_level2a, r2_level2b))









    



R squared values:
      level 1
      ----------
      overall:    0.63
      
      level 2
      ----------
      intercepts: 0.36
      slopes:     0.81

Analysis of variance (ANOVA)



In [60]:

    
trace.varnames









    Out[60]:





['g_00',
 'g_01',
 'sigma_a_interval__',
 'a',
 'g_10',
 'g_11',
 'sigma_b_interval__',
 'b',
 'sigma_y_interval__',
 'sigma_a',
 'sigma_b',
 'sigma_y']



In [61]:

    
p1 = forestplot(trace, varnames=['sigma_a', 'sigma_b', 'sigma_y'])

Model 4 - adding more site-level predictors



In [62]:

    
data['temperature_centered'] = data.temperature - data.temperature.mean()



In [63]:

    
data['wind_speed_centered'] = data.wind_speed - data.wind_speed.mean()



In [64]:

    
sites = numpy.copy(data.site.values) - 1



In [68]:

    
with Model() as model_4:
    
    # intercept
    g_00 = Normal('g_00', mu=0, tau=0.001)
    g_01 = Normal('g_01', mu=0, tau=0.001)
    sigma_a = Uniform('sigma_a', lower=0, upper=100)
    tau_a = sigma_a**-2
    mu_a = g_00 + (g_01 * data.forest_200m_centered.unique())
    a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
    
    # slope (land cover)
    g_10 = Normal('g_10', mu=0, tau=0.001)
    g_11 = Normal('g_11', mu=0, tau=0.001)
    sigma_b = Uniform('sigma_b', lower=0, upper=100)
    tau_b = sigma_b**-2
    mu_b = g_10 + (g_11 * data.forest_200m_centered.unique())
    b = Normal('b', mu=mu_b, tau=tau_b, shape=len(set(sites)))
    
    # temperature
    c = Uniform('c', lower=-10, upper=10, shape=len(set(sites)))
    
    # model error (data-level)
    sigma_y = Uniform('sigma_y', lower=0, upper=100)
    tau_y = sigma_y**-2
    
    # expected values
    y_hat = a[sites] + (b[sites] * data.weeks_centered) + (b[sites] * data.temperature_centered)
    
    # likelihood
    y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data.biophony_centered)
    
    # simulated
    #y_sim = Normal('y_sim', mu=y_hat, tau=tau_y, shape=y_hat.tag.test_value.shape)









    



Applied interval-transform to sigma_a and added transformed sigma_a_interval_ to model.
Applied interval-transform to sigma_b and added transformed sigma_b_interval_ to model.
Applied interval-transform to c and added transformed c_interval_ to model.
Applied interval-transform to sigma_y and added transformed sigma_y_interval_ to model.

	site	site_name	building_200m	forest_200m	d2n_200m
1	15	Hofgarten	12.094186	40.635720	0.142291
2	26	Bienenstraße	26.502499	20.139944	0.209561
3	5	Gumppstraße	41.708074	12.070743	0.288953
4	21	Templstraße	42.530716	17.342644	0.266094
5	18	Egger-Lienz-Straße	25.838039	12.899635	0.330288

	site	site_name	biophony	forest_200m	d2n_200m	temperature	week
0	1	Lans	1.555473	70.934164	0.049909	9.9	4.000000
1	1	Lans	1.022978	70.934164	0.049909	10.4	6.000000
2	1	Lans	3.600605	70.934164	0.049909	18.0	10.857143
3	1	Lans	6.499787	70.934164	0.049909	23.9	12.142857
4	1	Lans	4.385699	70.934164	0.049909	7.9	13.000000
5	2	Hofwald	0.329502	97.068706	0.003761	5.9	3.142857
6	2	Hofwald	0.321554	97.068706	0.003761	4.3	5.142857
7	2	Hofwald	1.815957	97.068706	0.003761	2.1	6.857143
8	2	Hofwald	2.847029	97.068706	0.003761	22.2	9.857143
9	2	Hofwald	1.652687	97.068706	0.003761	13.9	12.857143

	d2n_200m	forest_200m
site
1	0.049909	70.934164
2	0.003761	97.068706
3	0.195928	13.668314
4	0.034394	79.950193
5	0.288953	12.070743